We investigate training end-to-end speech recognition models with therecurrent neural network transducer (RNN-T): a streaming, all-neural,sequence-to-sequence architecture which jointly learns acoustic and languagemodel components from transcribed acoustic data. We explore various modelarchitectures and demonstrate how the model can be improved further ifadditional text or pronunciation data are available. The model consists of an`encoder', which is initialized from a connectionist temporalclassification-based (CTC) acoustic model, and a `decoder' which is partiallyinitialized from a recurrent neural network language model trained on text dataalone. The entire neural network is trained with the RNN-T loss and directlyoutputs the recognized transcript as a sequence of graphemes, thus performingend-to-end speech recognition. We find that performance can be improved furtherthrough the use of sub-word units (`wordpieces') which capture longer contextand significantly reduce substitution errors. The best RNN-T system, atwelve-layer LSTM encoder with a two-layer LSTM decoder trained with 30,000wordpieces as output targets achieves a word error rate of 8.5\% onvoice-search and 5.2\% on voice-dictation tasks and is comparable to astate-of-the-art baseline at 8.3\% on voice-search and 5.4\% voice-dictation.
展开▼